{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 浅层排序模型逻辑回归\n",
    "\n",
    "## LR(LogisticsRegression)背景知识介绍\n",
    "\n",
    "### 点击率预估与分类模型\n",
    "\n",
    "- 点击率预估用来预测用户在当前上下文对item的点击，根据点击率预估对item展现进行排序；\n",
    "\n",
    "- 点击率这里是个二分类问题，点了就是1，没点就是0，计算接近1的概率，接入预测点击率；\n",
    "\n",
    "### 什么是LR\n",
    "\n",
    "- 对回归模型的结果进行处理，使之成为0，1这样的标签\n",
    "\n",
    "### Sigmoid 函数\n",
    "\n",
    "- 阶跃函数\n",
    "\n",
    "### 训练流程\n",
    "\n",
    "- 从日志中获取训练样本与特征\n",
    "\n",
    "- 模型的参数学习\n",
    "\n",
    "- 模型预测\n",
    "\n",
    "### 优缺点\n",
    "\n",
    "#### 优点\n",
    "\n",
    "- 易于理解，计算代价小\n",
    "\n",
    "#### 缺点\n",
    "\n",
    "- 容易欠拟合，需要特征工程\n",
    "\n",
    "## LR算法数学原理解析\n",
    "\n",
    "### Sigmoid 函数\n",
    "\n",
    "$$f(x) = \\frac{1}{1 + e^{-x}}$$\n",
    "\n",
    "导数：\n",
    "\n",
    "$$f'(x) = \\frac{e^{-x}}{(1 + e^{-x})^2}$$\n",
    "\n",
    "可以看出，对于这个函数\n",
    "\n",
    "$$f'(x) = 1 - f(x)$$\n",
    "\n",
    "### LR数学模型\n",
    "\n",
    "$$w = w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n$$\n",
    "\n",
    "$$y = sigmoid(w)$$\n",
    "\n",
    "### 损失函数\n",
    "\n",
    "$$loss = log\\prod_{i=1}^n p(y_i|x_i)$$\n",
    "\n",
    "$$p(y_i|x_i) = h_w(x_i)^{y_i} (1 - h_w(x_i))^{1-y_i}$$\n",
    "\n",
    "对于单个样本\n",
    "\n",
    "$$loss = -(y_i log h_w(x_i) + (1 - y_i)log(1 - h_w(x_i)))$$\n",
    "\n",
    "### 梯度下降\n",
    "\n",
    "$$\\frac{\\partial loss}{\\partial w_j} = (h_w(x_i) - y_i)x_i^j$$\n",
    "\n",
    "$$w_j = w_j - \\alpha \\frac{\\partial loss}{\\partial w_j}$$\n",
    "\n",
    "### 正则化\n",
    "\n",
    "#### L1正则化\n",
    "\n",
    "$$lossNew = loss + \\alpha \\sum_{i=1}^n|w_j|$$\n",
    "\n",
    "#### L2正则化\n",
    "\n",
    "$$lossNew = loss + \\alpha |w|^2$$\n",
    "\n",
    "## 样本选择与特征构建的基本知识\n",
    "\n",
    "### 样本选择\n",
    "\n",
    "#### 样本选择规则\n",
    "\n",
    "- 采样比例，正负样本要维持合适的比例，比例根据产品特点决定\n",
    "\n",
    "- 采样率，可以使用随机采样方法\n",
    "\n",
    "#### 样本过滤规则\n",
    "\n",
    "- 结合业务情况过滤（去除爬虫请求，去除测试用数据，去除作弊数据，根据目标选取等等）\n",
    "\n",
    "- 异常点的过滤（统计学方法过滤）\n",
    "\n",
    "### 特征\n",
    "\n",
    "#### 类型\n",
    "\n",
    "- 连续型（不可穷举，如视频观看时长）\n",
    "\n",
    "- 离散型（可穷举，如学历）\n",
    "\n",
    "或\n",
    "\n",
    "- 低维度（如人的年龄、性别）\n",
    "\n",
    "- 高纬度（如这个人三十天内喜欢什么电影）\n",
    "\n",
    "或\n",
    "\n",
    "- 稳定特征（历史点击率）\n",
    "\n",
    "- 动态特征（天点击率）\n",
    "\n",
    "#### 特征的统计与分析\n",
    "\n",
    "- 获取难度\n",
    "\n",
    "- 覆盖率\n",
    "\n",
    "- 准确率\n",
    "\n",
    "#### 特征选择\n",
    "\n",
    "- 根据建模常识\n",
    "\n",
    "- 完成第一版特征后分解特征，如发现剪掉某个特征结果变好了，则应过滤掉该特征\n",
    "\n",
    "#### 特征预处理\n",
    "\n",
    "- 缺省值填充（众数等填充）\n",
    "\n",
    "- 归一化\n",
    "\n",
    "- 离散化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
